Software and Package Requirements:



1. STATA version 13.1, including the ivreg2 (or xtivreg2) command for IV regressions and diagnostic statistics. For installation type: ssc install ivreg2. Help for the command provided at: http://www.repec.org/bocode/i/ivreg2.html.



2. MATLAB 2015b (including the statistical toolbox) - for data processing and model generation.



3. R 3.1.2 (including glmnet package) - for LASSO instrument selection.

first need to call the library: library(glmnet). Main command: cvfit = cv.glmnet(x,y,penalty.factor = p.fac) where y is the depended variable (in our case the endogenous variable) and x includes the set of potential instruments (in our case the weather binary indicators) with penalty (penalty.factor=1) as well as all the controls without penalty (penalty.factor=0). See recommendations at: http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html. 







*NOTE*

Our results are based exclusively on individual level exercise data. However, individual level personal data tables have been redacted for legal and privacy related reasons. See the note in the subsection S4.8 of the supplementary materials.





In what follows we describe each data file and each script/code:



*Note that all the files start with LETTERNUMBER_ in order to be sorted alphabetically in the right analysis order.*



********************************RAW DATA************************************



*exercise data (only headers - redacted for legal reasons):



App_Users_in_Graph_demographics.txt: is the demographic information of the users that have at least one running-buddy (network-embedded). Demographics include user unique identifier, year of birth, gender, weight, height, join date, device type and the address at the time of registration that includes city, state, country and ZIP code (or post-code). We extract the exact geographic location of users in Latitude and Longitude either through the running GPS information in data file GPS_summary.txt or in the absence of GPS tracking through the address information using the Google Maps API (https://developers.google.com/maps/).



SPA_Social_graph.txt: is the running activity records of the network-embedded individuals. Each exercise activity is a row in the dataset. Includes information about the unique identifier of the runner, the type of activity (if it is run, walk etc), the duration of the activity, the distance, the calories burned, the local start time of the event as well as the timezone of the event. The Datenumber of the event refers to number of days that the event took place calculated from January 0, 0000 (Matlab�s internal specification). 



GPS_summary.txt: gives the exact starting geographic location in Latitude and Longitude of an exercise activity. Information includes the activity unique identifier, the provider of the GPS tracking (creator), the starting Latitude and Longitude as well as the user unique identifier. We use this data along with the address information to extract the geographic location of running. 



USERREL.txt: is the social network of the individuals (running-buddy network). Includes information about the two individuals that form a link as well as a timestamp for the date that the link was created .









*weather data (Source: National Oceanic and Atmospheric Administration): 



The following are cleaned Matlab structure files with daily precipitation and temperature data worldwide for a period of about 5 years. 



WeatherStations.mat: geographic location data for the weather stations that we use in our analysis. Headers: station id, latitude, longitude, if rain data is available (binary), if temperature data is available (binary). Note that the first two letters at station unique identifier refer to the country 2-letter abbreviation code. For example the unique identifier USXXXXXXX refers to a weather station in United States. Look at countries.txt for a complete list of 2-letter abbreviations.



PRECIPITATION.mat: Matlab data structure with daily precipitation data for the weather stations of consideration. PRECIPITATION(1).STATION is the first station id, PRECIPITATION(1).PRCP has the precipitation values in tenths of mm for a period of about seven years. The first column in PRECIPITATION(1).PRCP refers to the date (using Matlab�s datenumbers) while the second column is the precipitation values. Negative values are considered as missing data. NaN values refer to the cases that data are missing for the particular days.



TMAX_DATA.mat: Matlab structure with maximum daily temperature data for the weather stations of consideration. TMAX_DATA(1).STATION is the first station id, TMAX_DATA(1).TMAX has the temperature values in tenths of degree Celsius for a period of about seven years. The first column in TMAX_DATA(1).TMAX refers to the date (using Matlab�s datenumbers) while the second column is the maximum daily temperature values. Values of -9999 or -999 are considered as missing data. NaN values refer to the cases that data are missing for the particular days.



TMIN_DATA.mat: Matlab structure with minimum daily temperature data for the weather stations of consideration. TMIN_DATA(1).STATION is the first station id, TMIN_DATA(1).TMIN has the temperature values in tenths of degree Celsius for a period of about seven years. The first column in TMIN_DATA(1).TMIN refers to the date (using Matlab�s datenumbers) while the second column is the minimum temperature values. Values of -9999 or -999 are considered as missing data. NaN values refer to the cases that data are missing for the particular days.



We are also releasing the raw data files that we use to create the compact Matlab data structures (Date of access: Sep 8, 2015). The csv files are categorized per chronological period and per meteorological indicator. For example PRCP.csv includes all the daily precipitation values recorded for a period of about seven years. The first column is the station unique identifier (see stations.txt file for the location of these stations), the second column is the date in the format yyyymmdd and in the third column is the recorded value. The recorded values are in tenths of mm for precipitation and tenths of degrees Celsius for temperature. Values of -9999 or -999 are considered as missing data.

******************************************************************************















********************************ANALYSIS**************************************





***Preprocessing of the raw data*** 





The script Clean_social_graph_including_only_people_under_consideration.m isolates the social graph in which we have available geographic location information for all App users and assign to each user a number that is the location of his/her record in App_Users_in_Graph_demographics.csv data file. Also computes the correlations between the weather ego and friend experience in each link.



inputs:  App_Users_in_Graph_demographics.csv

         USERREL.csv



outputs: USERREL_USEDFOR_SOCIAL_INFLUENCE.csv (only headers-redacted for legal reasons)

         USERREL_USEDFOR_SOCIAL_INFLUENCE_wth_correlations.csv (only headers-redacted for legal reasons)


each row of the USERREL_USEDFOR_SOCIAL_INFLUENCE.csv file is the same as in USERREL.csv if the location of both users in a link is identified and with the additional two columns that refer to the position of both users in the App_Users_in_Graph_demographics.csv data file (user number). In addition USERREL_USEDFOR_SOCIAL_INFLUENCE_wth_correlations.csv includes the correlation between the weather that the two users in each link experience.






*the script Create_Activity_matrices.m generates the daily running activity for each individual of consideration. 



Input data files: SPA_Social_graph.csv

	     	  App_Users_in_Graph_demographics.csv

	     	  USERREL.csv



Output data files: run_mat.mat,

	      	   distance_mat.mat

	   	   duration_mat.mat

	  	   calories_mat.mat

	   	   pace_mat.mat

	   	   StartTime_mat

	      	   TimeZone_mat (all redacted for legal reasons)



These are same size matrices. The size of matrices is (number of days) x (number of individuals). For instance distance_mat(1,1000) is the distance individual with user number=1000 run on the 1st day of consideration. User number 1000 means that is in the 1000th row of App_Users_in_Graph_demographics.csv data file. StartTime_mat is the matrix that gives the local start time for the daily activity (eg. 10.5 means 10:30am) and TimeZone_mat gives the timezone for the daily activity for example -4 means GMT-4:00 and 0 means Greenwich time GMT+0:00.





*the script Assign_Weather_to_Individuals.m assigns Weather (precipitation and temperature) to each individual



input data files: App_Users_in_Graph_demographics.csv

            	  PRECIPITATION.mat

             	  TMAX_DATA.mat

             	  WeatherStations.mat

it also requires the Matlab function Distance.m that calculates the distance in km between two geographic coordinates given in Latitude and Longitude.



Output data files:  PRECIPITATION_mat (redacted for legal reasons)

               	    TMAX_mat (redacted for legal reasons)

the output files have size (number of days) x (number of individuals). For example. PRECIPITATION_mat(1,1000) is the precipitation individual with user number=1000 experiences on the 1st day of consideration.



***********************************











***Exercise Influence Model - Figure 2*** 





*the script exercise_influence_model.m generates the ego-level model data tables as described in detail in sections S2.2-S2.5 and S3.3 of the Supplementary Materials that will help us identify same day, one-day and two-day difference peer effects in running.



input data files: run_mat.mat

	   	  distance_mat.mat

	   	  duration_mat.mat

	   	  pace_mat.mat

	    	  TimeZone_mat.mat

	   	  StartTime_mat.mat 

	   	  PRECIPITATION_mat.mat

           	  TMAX_mat.mat

	   	  USERREL_USEDFOR_SOCIAL_INFLUENCE_wth_correlations.mat

	   	  App_Users_in_Graph_demographics.csv



output data file: ego_level_same_day_corr_thresh_0025_same_day.txt (only headers-redacted for legal reasons)

		  ego_level_same_day_corr_thresh_0025_one_day.txt (only headers-redacted for legal reasons)

 		  ego_level_same_day_corr_thresh_0025_two_day.txt (only headers-redacted for legal reasons)









*the STATA script exercise_influence.do replicates the results presented in tables S4-S7 in supplementary materials and in figure 2 of the main text:



input data files: ego_level_same_day_corr_thresh_0025_same_day.txt (only headers-redacted for legal reasons)

	   	  ego_level_same_day_corr_thresh_0025_one_day.txt (only headers-redacted for legal reasons)

	    	  ego_level_same_day_corr_thresh_0025_two_day.txt (only headers-redacted for legal reasons)







***************************************













***Interaction Models - Figure 3*** 



*The Matlab script Interaction_Model_A.m runs the scripts that generate the tables for the ego-level model with interactions A as described in detail in section S2.6 and S3.4 of the Supplementary Materials. 



Interaction Model A:

We examine how a more or less active peer (compared to ego�s running activity) affects ego's running and vice versa.

To do that we first calculate the fraction (LAMBDA) between each peer's average running activity over the ego's average running activity and we separate the peers into groups depending on this fraction. We consider 9 groups depending on the range of that fraction: LAMBDA=(-inf,1/16],(1/16,1/8],(1/8,1/4],(1/4,1/2],(1/2,2],(2,4],(4,8],(8,16],(16,inf]. For each Ego i, we then split her/his neighborhood (peers j = 1:kit) into subsets depending of the value of LAMBDA we calculate the average running activity of this subset.



input data files: 	run_mat.mat

	     		distance_mat.mat

	     		duration_mat.mat

	    		pace_mat.mat

			TimeZone_mat.mat

			StartTime_mat.mat 

	   	        PRECIPITATION_mat.mat

             		TMAX_mat.mat

	     		USERREL_USEDFOR_SOCIAL_INFLUENCE_wth_correlations.mat

	     		App_Users_in_Graph_demographics.csv



output data file:      Interaction_model_A.txt (only headers-redacted for legal reasons)













*The Matlab script Interaction_Model_B.m runs the code that generates the tables for the ego-level model with interactions B as described in details in section S2.6 and S3.4 of the Supplementary Materials. 



Interaction Model B:

How individuals with different levels of activity influence each other. We examine how two very active friends (or mostly inactive friends) influence each other. To do that we first separate all individuals into two categories, active (H) and inactive (L) by comparing their total running activity over the period of observation to the average running activity of all users. For each Ego i, we then split her/his neighborhood (peers j = 1:kit) into active (H) and inactive (L) and we calculate the average running activity of this subset. We consider 4 different scenarios: Ego active (H)- friend active (H), Ego active (H)- friend inactive (L), Ego inactive (L)- friend active (H), Ego inactive (L)- friend inactive (L).



input data files: 	run_mat.mat

	     		distance_mat.mat

	     		duration_mat.mat

	    		pace_mat.mat

			TimeZone_mat.mat

			StartTime_mat.mat 

	   	        PRECIPITATION_mat.mat

             		TMAX_mat.mat

	     		USERREL_USEDFOR_SOCIAL_INFLUENCE_wth_correlations.mat

	     		App_Users_in_Graph_demographics.csv



output data file:       Interaction_model_B.txt (only headers-redacted for legal reasons)









*The Matlab Script Interaction_Model_C.m runs the scripts that generate the tables for the ego-level model with interactions C as described in detail in section S2.6 and S3.4 of the Supplementary Materials. 



Interaction Model C: we are interested to identify how stickiness with exercise affects exercise influence. For each Ego i we split their neighborhood (peers j = 1:kit) into consistent and inconsistent and we calculate the average running activity of each subset. We consider 4 different cases: ego consistent - friend consistent, ego consistent - friend inconsistent, ego inconsistent - friend consistent, ego inconsistent - friend inconsistent.



input data files: 	run_mat.mat

	     		distance_mat.mat

	     		duration_mat.mat

	    		pace_mat.mat

			TimeZone_mat.mat

			StartTime_mat.mat 

	   	        PRECIPITATION_mat.mat

             		TMAX_mat.mat

	     		USERREL_USEDFOR_SOCIAL_INFLUENCE_wth_correlations.mat

	     		App_Users_in_Graph_demographics.csv



output data files:      Interaction_model_C.txt (only headers-redacted for legal reasons)






*The Matlab Script Interaction_Model_D.m runs the code that generate the tables for the ego-level model with interactions D as described in details in section S2.6 and S3.4 of the Supplementary Materials. 



Interaction Model D:

we are interested to identify how gender affects exercise influence. For each Ego i we split their neighborhood (peers j = 1:kit) into males and females and we calculate the average running activity of each subset. We consider 4 different cases: ego male - friend male, ego male - friend female, ego female - friend male, ego female - friend female.



input data files: 	run_mat.mat

	     		distance_mat.mat

	     		duration_mat.mat

	    		pace_mat.mat

			TimeZone_mat.mat

			StartTime_mat.mat 

	   	        PRECIPITATION_mat

             		TMAX_mat.mat

	     		USERREL_USEDFOR_SOCIAL_INFLUENCE_wth_correlations.mat

	     		App_Users_in_Graph_demographics.csv



output data file:       Interaction_model_D.txt (only headers-redacted for legal reasons)





*The STATA script interactions.do replicates the results presented in tables S8-S11 in supplementary materials and in Figure 3 of the main manuscript.



input data files: Interaction_model_A.txt (only headers-redacted for legal reasons)

		  Interaction_model_B.txt (only headers-redacted for legal reasons)

		  Interaction_model_C.txt (only headers-redacted for legal reasons)

		  Interaction_model_D.txt (only headers-redacted for legal reasons)



***********************************











***Structural Mechanisms - Figure 4*** 




*Structural Diversity: the Matlab script Structural_Diversity.m generates ego-level tables associated with the complex contagion and structural diversity analysis as described in S2.7 and S3.5.2 subsections in Supplementary Materials.



In more details for each ego we extract the number of peers that are active in each day but also the number of active connected components (i.e. the connected components in the ego's neighborhood where at least one individual is active).







input data files: 	run_mat.mat

	     		distance_mat.mat

	    	        duration_mat.mat

	     		pace_mat.mat

			TimeZone_mat.mat

			StartTime_mat.mat 

	     		PRECIPITATION_mat.mat

             		TMAX_mat.mat

	     		USERREL_USEDFOR_SOCIAL_INFLUENCE_wth_correlations.mat

	     		App_Users_in_Graph_demographics.csv





output data file:       structural_diversity_data.txt (headers only-redacted for legal reasons)







*Embeddedness: the Matlab script MatlabScript_Embeddedness.m runs the code and generates the data tables needed for the embeddedness analysis as described in S2.7 and S3.5.3 subsections in Supplementary Materials.



In more details for each ego we isolate the peers that share with the ego a common friend and calculate the average running activity of this sub-neighborhood. We do the same for the non embedded neighborhood (separately).



input data files: 	run_mat.mat

	     		distance_mat.mat

	    	        duration_mat.mat

	     		pace_mat.mat

			TimeZone_mat.mat

			StartTime_mat.mat 

	     		PRECIPITATION_mat.mat

             		TMAX_mat.mat

	     		USERREL_USEDFOR_SOCIAL_INFLUENCE_wth_correlations.mat

	     		App_Users_in_Graph_demographics.csv



output data files:      embedded_data.txt (only headers-redacted for legal reasons)







*the STATA script structdiversity_embeddedness.do replicates the results presented in tables S13,S15,S16 and S17 in supplementary materials (Figure 4 of the main manuscript).



input data files: nonembedded_data.txt (only headers-redacted for legal reasons)

		  structural_diversity_data.txt (only headers-redacted for legal reasons)

**************************************





******************************************************************************



******************************************************************************











